Discovery engine

ABSTRACT

A method that is relatively inexpensive to implement and that permits a user to conduct searches of electronically stored documents using an entire document, multiple documents or portions of a document as the search criteria and to collect, store and to share the relevant documents from the search.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a continuation and has benefit of priority ofU.S. patent application Ser. No. 14/199,985, titled “Discovery Engine”,filed on Mar. 6, 2014, which priority application is acontinuation-in-part of and has benefit of priority of U.S. patentapplication Ser. No. 13/441,123 filed Apr. 6, 2012 entitled “DiscoveryEngine”, having a common inventor herewith and being incorporated hereinin its entirety by reference.

U.S. patent application Ser. No. 14/199,985, titled “Discovery Engine”,filed on Mar. 6, 2014, also claims priority of U.S. Provisional PatentApplication No. 61/850,987 filed Feb. 27, 2013 entitled “DiscoveryEngine”, having a common inventor herewith and being incorporated hereinin its entirety by reference.

FIELD OF THE DISCLOSURE

The disclosures made herein relate generally to the search engine anddiscovery engine industry. Embodiments of the present inventiondisclosed herein are in the general classification of a device andmethodology for conducting a search of electronically stored documentsand collecting, storing and sharing the related documents found throughthe search.

BACKGROUND

This section introduces aspects that may be helpful in facilitating abetter understanding of embodiments of the present invention.Accordingly, the statements of this section are to be read in this lightand are not to be understood as admissions about what is in the priorart or what is not in the prior art.

Most individuals are familiar with manual searching for books, magazinesor documents in a library or similar setting. Searching, in its mostrudimentary form, often simply involves a researcher seeking a specificbook written by a particular author by perusing the library stacks bycategory type and utilizing alphabetical order or some otherorganizational scheme to locate the specific book.

Searching for documents stored electronically often involves searchingwithin a specific database via names or key words/search terms. When aresearcher must independently search each database, he will only uncoverdocuments stored in the selected database that relate to the searchterms, and he will not uncover any related documents stored in otherdatabases. This creates an organizational problem in that differentresearchers may search different databases attempting to find the sametype of documents. In other words, two different researchers may thinkthat a given document they are searching for should be contained in twodifferent databases due to their own notions of the propercategorization of the searched for document. As a result, one or bothresearchers may not discover the document that they are searching fordue to their failure to classify the document in the same manner as thecreator of the database and their failure to search the database deemedappropriate by the database creator.

With the advent of the Internet, millions of documents are availablethrough Internet search engines. An electronic document is a cohesivebody of text that is electronically accessible (e.g. a patent document,a news article, a legal case, a medical journal article or a webpage).Often, a group of documents are contained within a single source,dataset, collection or database. Most individuals are familiar with theprocess of searching for relevant documents within a document collectionvia keywords and search terms. A researcher types the key words/searchterms into the search engine to locate related documents and then siftsthrough the document results to determine which documents are mostrelevant.

If the researcher is satisfied with the results he obtains via the keyword search, he can print or save the documents and complete the search.However, often the researcher is not satisfied with the initial resultsand the query (i.e. key words or search terms) must be modified toobtain potentially better results. After a number of searches areperformed, the researcher often collects and organizes the results byprinting the documents or saving the documents into a folder. Theproblem with this searching methodology is twofold. First, the resultsof the search are dependent on the researcher's selection of key words.The researcher may not select the best key words or may not be able toobtain the best results by simply using a few words (i.e., search terms)and may obtain no results by using too many terms. Second, the documentresults saved or printed are not “living” documents in that theyrepresent how the document appeared when the document was saved orprinted. They are not dynamic and capable of being updated and thenviewed at a later date without further researcher involvement. Thedocument results are also a snapshot of the search conducted at a givenpoint in time and any documents added to the dataset after the searchwill not be included in the search results.

Keyword searching is still quite analogous to manually investigating acollection of printed documents. Software essentially just helps toperform that job more efficiently. The advent of the search engine was acornerstone in the evolution of information research, but a searchengine simply finds documents that contain some specific words.

Advanced search engines such as Google are forgiving in the sense thatthey can yield results that do not literally match on the keywords andallow the researcher to utilize natural language. Search engines, suchas Google, utilize a “Page Rank” that may skew results from any givensearch. “Page Rank” involves a link analysis algorithm that assigns avalue to each element of a set of documents to determine a document'srelative importance within the set of documents. The value assigned to adocument/webpage on the World Wide Web is defined recursively and iscalculated based on the number and “Page Rank” of all webpages that linkto the document with the theory being that a document linked to by manywebpages with high “Page Ranks” is also worthy of a high “Page Rank.”

Semantics also play a role in natural language queries in which“unimportant” words such as “the” and “it” are discarded while the“important” words and synonyms to those “important” words are actuallysearched which may ultimately create a huge index that still needs to bemanually inspected by the researcher.

Other database search engines (e.g. search engines for Wikipedia and theUnited States Patent and Trademark Office) utilize the familiar “Booleankeyword search” that is very literal and has its own distinct value andapplicability. If a researcher types in too many keywords, no matchesappear. If a researcher types in too few keywords, there are too manyand highly varying results. If a researcher is unsatisfied with theresults, he must rework the query by adding some complex operators (e.g.some combination of “AND”, “OR”, “NOT”, and/or parentheses).

If a researcher is unfamiliar with the nuances of the Boolean keywordsearch system, he may not properly utilize the Boolean operators and maynot structure the query in the proper manner to obtain the mostdesirable results. Moreover, a Boolean search is traditionallyunforgiving in that the search terms entered are either present or theyare not present in the selected range (e.g. in the entire document or inthe same sentence as one another).

Key word searching also may be difficult to perform in certainsituations because of the different meaning of given words (e.g. Chinaand china), causing a large number of varying search results that needto be perused by a researcher.

Traditional search solutions do not allow for electronic searching fordocuments utilizing an entire document or documents as the searchcriteria or utilizing portions of a document supplemented with key wordsentered by a researcher as the search criteria. For example, if one wereto copy an entire document and stick it into Google, Bing or Yahoosearches, one would get an error message because these search enginesare not designed to search entire documents. There are a few searchengines that do semantic searches of entire documents such as Text Wise.However, these prior art full document semantic search engines are suboptimum because they utilize logic based systems that require thingssuch as proximity searches for words (e.g. is the word “horse” withintwo words of the word “shoe”), Boolean logic (e.g. AND, OR, AND NOT) andattempts to understand the meaning of words by associating the wordswith other words using logic (e.g. the word “china” may be related tokitchenware if the word “porcelain” or “plate” is also used in the samedocument). This type of prior art semantic search using logic, Booleanlogic and proximity is computationally difficult and it increases boththe time and money required to perform searches and to index groups ofdocuments to be searched.

Other solutions also do not permit collection, storage and sharing ofthe documents found during this type of searching in a portable anddynamic manner.

The prior art searching technology simply allows a researcher to entersome keywords for searching that may yield a set of documents that atleast come close to the type of documents sought. Upon reviewing thesedocuments, if a researcher discovers some words in a related documentthat help him develop his search criteria, the prior art solutionsrequire him to enter those key words from that related document assearch terms to try to locate additional relevant documents. The contextof the language preceding and following those key words from the relateddocument is lost when a new key word search is performed using thistraditional searching technique. The prior art does not allow theresearcher to leverage the entirety of that particular related documentas the criteria for the next search.

In many document collections, the highest quality search criterion isactually the entire text of one of the documents in the database. A realdocument in the collection (or a new one that the researcher types infull) contains much more useful information than what a researchertypically types as keywords. The natural language of the document andall of its inherent properties tend to shine through, if analyzed withappropriate algorithms. When the text of an entire document or largeportions of text thereof are used as the search criteria, the set ofrelated documents returned are most similar to or related to theoriginal document or portions thereof. In “complexity theory” thisphenomenon is known as “emergence.” Emergence is the key to a naturalstepping-stone in the evolution of information research from a “searchengine” to a “discovery engine.”

A researcher conducting a document search, such as a patent search,could leverage a “discovery engine” as opposed to a “search engine” toobtain superior results. In this type of search, the researcher alreadyhas a full description of the patent/document. The description can besubmitted as the search criteria and the top related documents can bereturned. Some of the results may look very relevant and the researchercan hold/identify these documents to enable him to return to them later.The researcher also can identify others to ignore so they do not show upas results again. If one of the documents discovered looks extremelyrelevant, the researcher can perform a further search using that entirerelevant document as the search criteria to view the top relateddocuments to that relevant document. The search criteria are effectivelychanging each time a search is performed without having to rework aquery manually each time based on search results.

Hence, there is a need for a device and methodology that efficiently,reliably and affordably permit a user to utilize the text of an entiredocument as the search criteria and/or to utilize an entire documentalong with supplemental text supplied by a researcher and/or multipledocuments or subsections of documents as the search criteria and/or anycombination of these potential search criteria. There is also a need fora device and methodology that permit a user to collect, store and sharethe collected/related documents from a search with other users and tofurther permit any individual to conduct an updated search for any newlyadded documents in a dataset based on the same search criteria.

SUMMARY OF THE DISCLOSURE

A device configured in accordance with a preferred embodiment of thepresent invention includes a memory containing a set of instructions anda processor for processing the set of instructions. The set ofinstructions include instructions for selecting at least one category ofsources (either automatically or through user selection); selecting atleast one source (i.e. a collection of documents) within at least onecategory of sources (either automatically or through user selection);utilizing search terms to search the at least one source (assuming onedoes not already have a document already usable as the search criteria);returning related documents from the at least one source based on thesearch terms; collecting any of the related documents into a collection;permitting at least one related document returned to be selected for afurther search utilizing the at least one related document as the searchcriteria in a selected source to return additional related documents(assuming one does not already have a document already usable as thesearch criteria); and exporting the collection of related documents bycreating a Uniform Resource Locator (URL) with all of the collectedrelated documents stored at a location referenced in the URL.

In preferred embodiments of the present invention, a set of instructionsmay also include instructions for utilizing a document such as a webpageto automatically conduct a search of designated sources for documentsrelated to the document based on the content of the document andinstructions for displaying the related documents found in the search ina collection and storing the collection under a single URL that can beutilized to display the collection.

In one embodiment of the present invention, a system comprises a memorycontaining a set of instructions and a processor for processing the setof instructions. The instructions cause the processor to perform amethod comprising a plurality of operations. An operation is performedfor receiving a current instance of search criteria, followed by anoperation being performed for determining tokens in the current instanceof the search criteria. For each document of at least one dataset, anoperation is performed for determining each token that has at least oneoccurrence thereof within the current instance of the search criteriaand within the document. For each document of the at least one dataset,an operation is performed for generating a similarity score indicating adegree of relevance of contents of the document to the current instanceof the search criteria. Generating the similarity score includescharacterizing similarity based on a number of times each token presentin both the document and the current instance of the search criteria andbased on uniqueness of each token with respect to each other token.

In another embodiment of the present invention, a non-transitorycomputer-readable medium has tangibly embodied thereon and accessibletherefrom processor-executable instructions that, when executed by atleast one data processing device of at least one computer, causes saidat least one data processing device to perform a method comprising aplurality of operations. An operation is performed for receiving acurrent instance of search criteria, followed by an operation beingperformed for determining tokens in the current instance of the searchcriteria. For each document of at least one dataset, an operation isperformed for determining each token that has at least one occurrencethereof within the current instance of the search criteria and withinthe document. For each document of the at least one dataset, anoperation is performed for generating a similarity score indicating adegree of relevance of contents of the document to the current instanceof the search criteria. Generating the similarity score includescharacterizing similarity based on a number of times each token presentin both the document and the current instance of the search criteria andbased on uniqueness of each token with respect to each other token.

In another embodiment of the present invention, a non-transitorycomputer-readable medium has tangibly embodied thereon and accessibletherefrom processor-executable instructions that, when executed by atleast one data processing device of at least one computer, causes saidat least one data processing device to perform a method comprising aplurality of operations. An operation is performed for receiving acurrent instance of search criteria. The current instance of the searchcriteria includes a uniform resource locator (URL). An operation isperformed for determining tokens in the current instance of the searchcriteria. For each document of at least one source of documents, anoperation is performed for performing a first frequency count forcharacterizing a number of times that each one of the tokens occurswithin the text used as the current instance of the search criteria incomparison to each one of the documents in the at least one source ofdocuments. For each one of the tokens, an operation is performed forperforming a second frequency count for characterizing an aggregatenumber of times that a particular one of the tokens occurs within all ofthe documents in the at least one source of documents. For each documentin the at least one source of documents, an operation is performed forgenerating a similarity score between the text used as the currentinstance of the search criteria and a particular one of the documents,wherein the similarity score is a function of the first frequency countfor the particular one of the documents and the second frequency countfor each token in the particular one of the documents.

In accordance with embodiments of the present invention, a preferredmethodology for searching a collection of electronically storeddocuments when one does not already have a document available to serveas the search criteria involves: (1) selecting at least one category ofsources; (2) selecting at least one source (i.e. a collection ofdocuments) within at least one category of sources; (3) utilizing searchterms to search the at least one source; (4) returning related documentsfrom the at least one source based on the search terms; (5) collectingany of the related documents into a collection; (6) permitting at leastone related document returned to be selected for a further searchutilizing the at least one related document as the search criteria in aselected source to return additional related documents; and (7) creatinga URL with all of the collected related documents stored at a locationreferenced in the URL. In at least one embodiment of the presentinvention, the at least one related document selected as the searchcriteria is the text from a URL.

The step of collecting any of the related documents into a collectionmay involve identifying the related documents to be collected from eachsource. The step of collecting documents may be performed to collectadditional related documents after any search.

Some embodiments of the present invention may further involve sharingthe relevant documents by sending the URL to select other users via anyelectronic method, including social networking websites and electronicmail services.

Some embodiments of the present invention may involve utilizing adocument such as a webpage to automatically conduct a search ofdesignated sources for documents related to the document based on thecontents of the document and displaying the related documents found inthe search in a collection and storing the collection under a single URLthat can be utilized to display the collection.

Some embodiments of the present invention may provide a method that isrelatively inexpensive to implement and that permits a user to conductsearches of electronically stored documents using an entire document,multiple documents or portions of a document as the search criteria andto collect, store and to share the relevant documents from the search.

Some embodiments of the present invention may provide a device andmethod that are not operationally complex that permit a user toefficiently and effectively conduct searches of electronically storeddocuments using an entire document (which may or may not be the text ofa URL itself), multiple documents or portions of a document as thesearch criteria and to collect, store and share the relevant documentsfrom the search.

Some embodiments of the present invention may provide better searchresults than traditional Boolean or natural language searches utilizingonly search terms input by a user by utilizing an entire document,documents, portions of documents or portions of documents supplementedwith user input search terms.

Some embodiments of the present invention may provide more convenientlycollected, stored and shared documents and document collections.

Some embodiments of the present invention may provide more dynamicsearch results that can be constantly updated without user involvementdue to the nature of the searching and storage of the search results.

Some embodiments of the present invention may provide searchingtechnology that does not require the use of an extreme amount ofcomputer resources because the indexing of documents and the searchingof documents in accordance with embodiments of the present invention isdone in a manner which does not involved proximity searching, Booleanlogic and/or other types of logic searching (e.g. trying to understandthe meaning of words by looking for associations with other words). Inthe context of the disclosures made herein, indexing refers to amechanism to collect and store information that is used in the processof information retrieval.

Some embodiments of the present invention may provide a computingprocess that indexes documents in the electronically stored documents(i.e. dataset) by counting words/tokens in each document in the storeddocuments; determining a numeric value that measures the magnitude ofsignificance for each unique word/token in the dataset as a whole (i.e.common token/words in the dataset as a whole are less significant);using a numeric representation of the frequency of unique words/tokensin the document being used as the search criteria and multiplying afrequency of each unique word/token in the search document by thenumeric value of its significance magnitude (i.e., the numeric valuerepresenting its magnitude of significance) so that each uniqueword/token in the search document has a numeric value representing theproduct of both its frequency of use factor (which may be a logarithmicderivate of the raw number) and its significance factor. After thisnumeric significance/frequency product is calculated for each uniqueword/token in the search criteria document one can then compare thenumeric significance/frequency product for other documents in thedataset with the search document to do comparisons of aggregatesimilarities (i.e. overlap of words/tokens of importance).

Some embodiment of the present invention may be implemented in the formof a methodology that forms a paradigm that is fundamentally sound andextensible (e.g. multiple documents, an existing document that isaugmented with some text supplied by the researcher or subsections ofdocuments can be used as the search criteria to point to other relateddocuments).

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments of systems, apparatuses and/or methods configured inaccordance with the present invention are now described, by way ofexample only, and with reference to the accompanying drawings.

FIG. 1 depicts a device (e.g., a system or apparatus) configured inaccordance with an embodiment of the present invention for implementinga method of searching a source of electronically stored documents andcollecting, storing and sharing the related documents from the search.

FIG. 2 depicts a screen shot of an exemplary sign-in webpage used inaccessing a website to perform a methodology configured in accordancewith an embodiment of the present invention for searching a source ofelectronically stored documents.

FIG. 3 depicts an interactive webpage for use in setting up a searchenvironment in an embodiment of the present invention.

FIG. 4 depicts an interactive webpage for use in conducting a searchassociated with a preferred embodiment of the present invention.

FIG. 5 depicts another interactive webpage for use in conducting asearch in accordance with an embodiment of the present invention.

FIG. 6 depicts another interactive webpage for use in conducting asearch in accordance with an embodiment of the present invention.

FIG. 7 depicts a webpage displaying a collection with eachdocument/webpage having a table associated therewith.

FIG. 8 displays a webpage displaying an electronic magazine (electronicdocument collection) that can be stored and shared via a single URL.

FIG. 9 depicts the methodology configured in accordance with anembodiment of the present invention for searching, collecting, storingand sharing electronic documents.

FIG. 10 depicts a methodology configured in accordance with anembodiment of the present invention for creating an electroniccollection of related documents from a variety of sources based on theentire text of a single electronic document.

FIG. 11 depicts a method configured in accordance with an embodiment ofthe present invention for enabling search results for multiple datasetsto be compared using normalized similarity scores for documents of thedatasets.

FIG. 12 is a diagrammatic representation showing an embodiment of acomputer system configured in accordance with an embodiment of thepresent invention.

DETAILED DESCRIPTION OF THE DRAWINGS

It is contemplated that a method described herein, which is configuredin accordance with an embodiment of the present invention, can beimplemented as software, including a computer-readable medium (e.g., anon-transitory computer-readable medium) having program instructionsexecuting on a computer, hardware, firmware, or a combination thereof.The method described herein also may be implemented in variouscombinations on hardware and/or software.

FIG. 1 depicts a device 10 configured in accordance with an embodimentof the present invention for implementing the method of searching asource of electronically stored documents and collecting, storing andsharing the related documents from the search. The device 10 has amemory 12 containing a set of instructions 13 and a processor 11 forimplementing the set of instructions 13. The set of instructions 13 mayinclude instructions for: allowing a user to sign into an account usingany of a plurality of approved website accounts; selecting at least onecategory of sources; selecting at least one source (i.e. a collection orset of documents) within at least one category of sources; utilizingsearch terms/criteria to search the at least one source; returningrelated documents from the at least one source based on the searchterms; collecting any of the related documents into a collection;permitting at least one related document returned to be selected for afurther search utilizing the content of the at least one relateddocument as the search criteria in a selected source to returnadditional related documents; creating a Uniform Resource Locator (URL)with the collection stored at a location referenced in the URL; andexporting the collection of related documents by sending the URLassociated with the collection to other selected users via anyelectronic method, including social networking websites and electronicmail services.

In preferred embodiments of the present invention, default categoriesand sources are utilized, allowing the categories and sources to beautomatically selected without user involvement.

The set of instructions 13 may further include instructions whereincollecting any of the related documents into a collection involvesidentifying the related documents to be stored from the at least onesource.

Alternatively, the set of instructions 13 may include instructions forutilizing a document such as a webpage to automatically conduct a searchof designated sources for documents related to the document based on thecontent in the document and instructions for displaying the relateddocuments found in the search in a collection and storing the collectionunder a single URL that can be utilized to display the collection.

In conducting a search of a source, also known as a dataset, for otherdocuments related to a document based on the content/text in thedocument, a computing process is run that assimilates the entire datasetto prepare the necessary data structures. Thereafter, any documentwhether it resides in the dataset or not, can have its similarity scorecalculated against every document in the dataset.

For each document in a dataset, the computing process should parse outthe relevant text from any markup. For example, if the markup languagein a document contains <title> Searching Techniques <title>, therelevant text “Searching Techniques” is parsed out and the two instancesof <title> in the markup language are removed. By further way ofexample, parsing out relevant text may also involve only utilizing thetext of a blog article and ignoring the comments contained in the blog.Often the comments in a blog are drafted by numerous authors, resultingin inconsistent and/or inaccurate term usage. Hence, a researcher maydetermine that counting the appearances of certain terms/words byincluding the blog comments may not increase the accuracy of searchresults.

The computing process should also lowercase or uppercase all letters inthe text of documents in the dataset and may correct misspellings. Thisapproach helps create consistency when terms are being counted andcompared between any two documents. The computing process alsodetermines tokens in each document. For example, each word in a documentcan be considered a token.

The computing process should further remove tokens that are stopwords.For example, definite and indefinite articles or transitional phrasesshould be removed as they are less likely to be useful in determiningsimilarity scores between documents.

The computing process should also stem each token in the documents.Stemming each token may involve removing prefixes and suffixes fromwords to utilize them in the similarity calculation between twodocuments.

The computing process may also transform phrases into individual tokensby, for example, taking a multiword phrase and making it into a singletoken in all documents.

The computing process may also associate each token with particularsections of the document. For example, a word that is used in the titlemay be weighted more heavily than the same word being used in theregular text of the document (e.g. the frequency count for that tokenmay be transformed/increased to account for a token's use in the title).

In preferred embodiments of the present invention, the computing processgenerates a frequency count of tokens for each document (i.e., a firstfrequency count). The computing process may transform the count of anygiven token based on the sections it is associated with and may alsonormalize the counts such that the length of the document is lessrelevant or not relevant. For example, ten occurrences of a word in asingle page document could be normalized to be equivalent to fiftyoccurrences of the same word in a five-page document.

The computing process may also transform the token counts in ways deemedappropriate for the language or nature of the dataset. In certainlanguages, certain words may have more significance than other words inthat same language. Hence, certain words may be weighted more heavily inconducting a similarity calculation.

The computing process may also involve calculating other statistics thatapply to each token. For example, a word's distance away from the frontof a document may be calculated and used to transform the token counts,if desired.

In preferred embodiments of the present invention, the computing processinverts the data such that each token has a set of documents it residesin, along with the associated counts for each token and potentiallyother statistics obtained through the computing process.

The computing process may, for each token in the set of unique tokens inthe dataset, determine a numeric value that measures the magnitude ofits significance in the dataset. For example, a word that occurs fewtimes in the dataset may be deemed more important than a word thatoccurs many times in the entire dataset. Therefore, if one is searchinga patent dataset such as the United States Patent and Trademark Officepatent database for a word very commonly used in this type of datasetsuch as “claim” or “comprising” then these words will be given verylittle weight in the final calculation of the product of thesignificance magnitude and numeric frequency calculation (i.e. thesimilarity calculation) despite high numeric frequency scores in thedocuments. On the other hand, if a word is rarely used in a dataset(e.g. “interstellar” in the patent database) then this word will beweighted much more heavily in conducting similarity calculation.

This step of the computing process related to determining a numericvalue that measures the magnitude of each token's significance in thedataset is not used in transforming any of the data from the othersteps. If such a transformation did occur, then whenever new documentswere added to the dataset, all or many of the other steps (or subsets ofthese steps) of the computing process would need to be rerun. This typeof duplicative processing would be expensive. Moreover, if a user wantedto alter the weighting assigned to the step of determining a numericvalue that measures the magnitude of each token's significance in thedataset to determine how it affects the quality of the similarityscores, all or many of the steps would also need to be rerun.

For any given document of text (or in some cases simply text typed in bya user or input from multiple documents), a similarity calculationdetermines for each other document in the dataset a numeric similarityscore. The computing process to determine the similarity score involves,with the possible aid of the statistics calculated during the computingprocess, comparing each token's count in a designated document or text(i.e., an artifact or portion thereof that is being searched) to itsmatching token's count in each other document in the dataset. However,the computing process may not necessarily use a raw count of tokens in adocument as a factor in the similarity score. In some embodiments of thepresent invention, it may be preferable to use a logarithmictransformation of the raw token count to decrease the magnitude ratherthan the raw number itself for the numeric frequency factor. For a giventoken, the magnitude of closeness of the two such token counts (or theirderivatives) between two documents has a directly proportionalcontribution to the magnitude of the similarity score (i.e. the closerthe token counts are for each token included in two compared documents,the more significant the contribution to improving the similarityscore).

The computing process to determine the similarity score may furtherinvolve including an inversely proportional contribution to themagnitude of the similarity score for tokens that are in the designateddocument but not in another document in the dataset being compared tothe designated document or for tokens that are not in the designateddocument but are in another document in the dataset being compared tothe designated document. A token with a high token count in a firstdocument that does not appear at all in a second document being comparedto the first document will make a more significant contribution toreducing the magnitude of the similarity score than a token with a lowtoken count in a first document that does not appear at all in a seconddocument being compared to the first document. Moreover, the greater thenumber of tokens that appear in a first document and not in a seconddocument being compared to the first document and vice versa, the moresignificant the contribution to reducing the magnitude of the similarityscore.

Whenever the step of determining a numeric value that measures themagnitude of each token's significance in the dataset is utilized in thecomputing process, a given token's value of significance has a directlyproportional contribution to the magnitude of the similarity scorebetween two documents. In other words, if a particular token's value ofsignificance is high, then the closeness of that particular token'scount between documents is of increased importance in the similaritycalculation between those documents. In a preferred embodiment of thepresent invention, this means that if a given token count is exactly thesame in a designated document and a compared document, the higher thevalue of significance for that particular token, the more favorableimpact that exact token match will have in the similarity calculationbetween the documents.

The similarity calculation should be applied in such a manner that aperfect similarity score between two documents can only be obtained ifall of the token counts in the designated document match all of thetoken counts in the compared document and all of the token counts in thecompared document match all of the token counts in the designateddocument and all such token values of significance for all tokens in thedesignated document and the compared document are equal to the maximumvalue in the entire set of values of significance (i.e. the maximumvalue of significance given to any token in the dataset is given to eachtoken in the documents).

By conducting such a similarity calculation for all documents in adataset, the top N most similar documents or least similar documents toa designated document or text can then easily be obtained. A givensimilarity score is consistently comparable to any other similarityscore in the dataset, but it may not be comparable to a similarity scorecalculated by passing the designated document through some otherentirely different dataset. Because the process defines that asimilarity score is calculated for every document in the dataset, thattotal set of similarity scores can be used to normalize each of thosesimilarity scores to something comparable across datasets. Givenextremely normalized similarity scores (i.e., normalized similarityscores that are highly different from each other), a given designateddocument can yield a useful single set of similar documents derived frommultiple datasets, by applying the condition that a given normalizedsimilarity score is beyond some standard threshold. It is important tonote that while a high similarity score may often be better based on thecomputing process, it can also be the case that a low or average scorewould produce the best match of similar or dissimilar documents incertain situations.

To implement the above-described computing process for searching asource, also known as a dataset, for other documents related to adocument based on the content in the document (i.e. the text of thedocument used as the search criteria), the set of instructions 13 mayfurther include instructions for: parsing out the relevant text from anymarkup in a document and all documents in the source; lowercasing oruppercasing all letters in the text of a document and all documents inthe source; correcting misspellings of words in a document and alldocuments in the source; determining tokens in a document and alldocuments in the source; removing tokens that are stopwords from thedocument and all documents in the source; stemming each token in thedocument and all documents in the source; transforming phrases intoindividual tokens in the document and all documents in the source;associating each token with a particular section in the document and alldocuments in the source; obtaining a frequency count of the tokens forthe document and all documents in the source; transforming the count ofany given token based on the sections it is associated with in thedocument and all documents in the source; normalizing the counts of thetokens for the document and all documents in the source; transformingthe counts of the tokens in ways deemed appropriate for the language ornature of the dataset for the document and all documents in the source;calculating other statistics that apply to each token in the documentand all documents in the source; inverting the data such that each tokenhas a set of documents it resides in from the source, along with theassociated counts and statistics; and determining a numeric value thatmeasures the magnitude of each token's significance in the source. Somebut not necessarily all of the computing instructions set out above maybe used in the subject invention.

The set of instructions 13 further include instructions for: comparingeach token's count (or derivative) in a document to its matching token'scount (or derivative) in another document in the source wherein themagnitude of closeness of the two counts has a directly proportionalcontribution to the magnitude of the similarity score between thosedocuments. The set of instructions 13 may further include instructionsfor: determining which tokens are present in the document but notpresent in other documents and vice versa in the source and including aninversely proportional contribution to the magnitude of the similarityscore between the document and another document based on the magnitudeof each such tokens' count and the total number of each such tokens;utilizing a token's value of significance to include a directlyproportional contribution to the magnitude of the similarity score basedon the closeness of a token's count between the document and each otherdocument in the source; sorting the set of similarity scores from thesource; and displaying the similarity scores from the source inascending or descending order.

In preferred embodiments, indexing a document in a “flat” manner thatcounts words (i.e., tokens) rather than comparing them using proximitysearches, Boolean logic and other logic algorithms is desired. The term“flat” as used herein shall mean that the search engine and indexing ofa document does not attempt to understand the meaning of a word/token,its context vis-à-vis any other particular word or token or itsproximity to other words/tokens. Indexing in a flat manner refers toonly determining the overall words/tokens used within the datasetupfront and wait to determine the similarity/relevance in real-time atthe time of the sear criteria based search as opposed to prior artapproaches that require creating a “deep” decision tree of all thepossible combinations and relationships within a dataset prior to orexclusive of any subsequent searching based on search criteria. In thisregard, embodiments of the present invention are not directed to howdocuments in a dataset compare to each other (i.e., deep search) but aredirected to implementing searching in a manner that (a) compares thesearch criteria/seed to each individual document in the dataset and (b)overlays the intersection of any two document pairs against the datasetas a whole (i.e., is shallow search that only goes 2 levels deep).Searches implemented in accordance with preferred embodiments of thepresent invention do not attempt to compare a search seed to multipledocuments, but rather do a 1-to-1 analysis with overall dataset overlay.A beneficial aspect of searching in this manner is that it allows one tosearch not just full documents but also URLs (i.e., In the context ofthe present invention, search criteria can be a URL).

A flat search refers to a search in which an algorithm counts both thenumber of times a unique word (token) is used in a dataset of multipledocuments and also the number of times the same unique word is used in asearch criteria document. The flat search then gives the individual word(token) a numeric value characterizing a magnitude of significance(i.e., significance magnitude factor) in the search criteria (i.e.,document representing the search criteria) that is the product of thefrequency of the use of the word in the search criteria document (thisfrequency could be logarithmic) and the significance of the word in thedataset (the significance score in the dataset is inversely proportionalto frequency of use in the dataset). The frequency of the use of theword in the search criteria document is determined via a first frequencycount that characterizes a number of times that each one of the tokensoccurs within the text search criteria and the significance of the wordin the dataset is determine d via a second frequency count thatcharacterizes an aggregate number of times that a particular one of thetokens of the search criteria occurs within all of the documents in thedataset). As the significance magnitude factor for each unique word areaggregated, each document in the dataset receives an aggregated numericsimilarity score (i.e., similarity score) in comparison to the searchcriteria document. It is important to note that flat searches as definedherein do not rely upon logic, proximity or any other attempts tounderstand the meaning of individual words.

In order to speed up the searching of the dataset, it may be desirable(although it is not necessary because of the speed and efficiency offlat searches described herein) to screen out documents in the datasetby picking out the token or tokens with the highest similarity scores(e.g. numeric frequency factor times significance magnitude) in thesearch document (i.e. words that are used often in the search documentand that are not used often in the dataset being searched) and only dosimilarity searches on documents in the dataset that have these highestranking tokens. For example if the word “interstellar” is used 100 timesin a patent (i.e. the search document) and one is searching the USpatent database (i.e. dataset) for similar patents and applications itmay be desirable to only look at the patents and applications thatcontain that word. This may dramatically increase the speed of thesearch of the dataset since one would be searching only hundreds orperhaps thousands of patents rather than the 8 million plus patents inthe US patent database. From time to time this screening process maymiss an important document but the increase in speed and the ability todramatically cut back in computing power may make this screening processworthwhile.

FIG. 2 depicts a screen shot of an exemplary sign-in webpage used inaccessing a website to perform the preferred methodology for searching asource of electronically stored documents. In this example, a researchercan log into a designated website (e.g., a website referred to herein as“the Enlyton website”) website using an e-mail address field 20 andpassword field 21 from any of a variety of different accounts. Forexample, a Google, Yahoo or Facebook account could be utilized forpurposes of signing into the Enlyton website for conducting a researchproject utilizing the Google tab 22, the Facebook tab 23 or the Yahootab 24.

FIG. 3 depicts an interactive webpage for use in setting up the searchenvironment in a preferred embodiment of the present invention. Aftersigning into the Enlyton website, an interactive webpage 33 isdisplayed. At the top of the webpage, several icons and links aredisplayed. One sign-in icon 30 permits a researcher to sign in from adifferent account (e.g. Yahoo, Google or Facebook). A disk icon 31permits a user to save a current project by clicking on the disk icon 31and following the instructions. Alternatively, a new project tab 32allows a user to create a different project by clicking on it.

The interactive webpage 33 also permits the researcher to select theproper research environment for a search. For example, variouscategories of documents and related icons are shown on the left side ofthe webpage 33. These categories include: Intellectual Property 34,Technology 35, Market 36, Finance 37, Health 38, Law 39 and AllDatasources 40. Obviously, the categories listed are merely illustrativeand other categories of documents may also be created. The categoriesideally have several different sources associated with each category,which are available for searching. The All Datasources 40 is anall-inclusive category wherein a researcher can select from allavailable sources for searching.

For example, under Intellectual Property 34, a researcher can selectwhether to search the Request for Comments (RFCs) 41, Wikipedia articles(Wikis) 42, the United States Patent and Trademark Office patentdatabase (patents) 43, the Institute of Electrical and ElectronicEngineers (IEEE) articles 44 or whatever other sources are available forsearching. In this embodiment, a user simply clicks in the box iconassociated with any or all of these sources to create a check markinside the associated box.

After selecting the appropriate research environment, all selectedsources will appear at the top of the webpage 33 in tabs. In the exampledepicted in FIG. 3, the researcher has only selected Wikis 42 and RFCs43 under the Intellectual Property category. Hence, only the Wikis tab45 and RFCs tab 46 appear at the top of the webpage as searchable datasources along with the Web tab 47 which allows a researcher to conductan Internet search. A plus symbol tab 48 which allows a researcher tochange the research environment to add other sources at any time alsoappears at the top of the webpage 33. If a researcher clicks on the plussymbol tab 48, he can change the research environment and click on theapply changes tab 49 to add or subtract sources. As can also be seen inFIG. 3, the corresponding sources are also checked in the AllDatasources 40 category when Wikis 42 and RFCs 41 are selected under theIntellectual Property Category 34.

After the proper research environment is created, a researcher thenclicks on the desired source tabs to conduct a search specific to thatsource. For example, a researcher could click on the Wikis tab 45 tosearch the indexed Wikipedia articles related to whatever search termsthe researcher inputs.

FIG. 4 depicts an interactive webpage for use in conducting a searchassociated with a preferred embodiment of the present invention. Aresearcher may enter desired search terms into the natural languagesearch box 61 shown on the interactive webpage 60. Preferably, theresearcher will utilize many relevant search terms or cut and pasteportions of documents or entire documents into the natural languagesearch box 61 to create the search terms. The computing process of apreferred embodiment of the present invention described in conjunctionwith FIG. 1 allows the entire text inserted as search terms/criteria tobe utilized in conducting a search for related documents in a givensource. Because the use of the entire text of the documents is often thebest search criteria, the researcher is encouraged to submit as muchtext as possible. This use of large amounts of text or entire documentseither cannot be entered into typical search engines (e.g. Google, Bing,Yahoo) or causes them to give error messages or to break down. If aresearcher desires to emphasize the importance of certain search terms,he can insert emphasis (such as three asterisks) next to certainlanguage to accentuate this language during the searching methodology.This additional emphasis will be utilized in the computing process toincrease the designated tokens' value of significance by eitherincreasing the magnitude of the significance factor or increasing thenumerical frequency factor (or a combination of both).

In preferred embodiments of the present invention, the text inserted asthe search criteria (e.g. full document) to be utilized in conductingthe search for related documents is a URL itself. A uniform resourcelocator, abbreviated URL, is also sometimes known as web address. It isa specific character string that constitutes a reference to a resource.In web browsers, the URL of a web page is typically displayed on topinside an address bar. An example of a typical URL would be“http://www.enlyton.com/”.

Traditional search engines (e.g. Google, Yahoo, Bing and so forth)cannot search URLs themselves for various reasons (e.g. too many wordsto search). However, in the case of a discovery engine configured inaccordance with an embodiment of the present invention, it is possibleto use a URL as the search criteria for a full document search. The URLof interest can be automatically curated (e.g., transformed) into theform of an XML (Extensible Markup Language) file or other suitabletext-based file format (e.g., JAVA Script Object Notation (JSON) format)that can be linked or associated with the URL.

While HyperText Markup Language (HTML) is the main markup language forcreating web pages and other information that can be displayed in a webbrowser, the discovery engine disclosed herein can transform the HTMLfile or other file format of URL content (which tells the browser how todisplay the underlying content) and convert it into an XML file or othersuitable text-based file format which is a preferred way to structure,store and transport content. This is important because XML data isstored in text format. XML makes it easier to expand or upgrade to newoperating systems, new applications, or new browsers, without losingdata. This is crucial because one of the most time-consuming challengesfor developers is to exchange data between incompatible systems over theInternet and exchanging data as XML greatly reduces this complexity,since the data can be read by different incompatible applications. Inthis regard, XML file format and other suitable text-based file formatsare formats that are preferred formats for search criteria generatedfrom URL content. Additionally, it is disclosed herein that, althoughHTML is a primary type of file format for URL content, search criteriaused in association with embodiments of the present invention can be URLcontent in a file format/mark up language other than HTML (e.g., an opensource file format such as TXT, SDF, XML; a proprietary file format suchas PDF, PPT, WORD; etc).

Even when using XML, it may be desirable to screen out the othernon-mark up superfluous content from the basic content of the URL. Thissuperfluous content is sometimes referred to as “chrome” and it caninclude text in the URL that is unimportant to the core content of theURL. For a URL to be used as the search criteria, the computing processshould parse out the relevant content from any markup. For example, ifthe markup language in a document contains <title> Searching Techniques<title>, the relevant text “Searching Techniques” is parsed out and thetwo instances of <title> in the markup language are removed. Generally,strings of unicode that constitute markup either begin with thecharacter < and end with a >, or they begin with the character & and endwith a. Strings of unicode characters that are not markup are typicallycontent but this is not always the case. In addition to mark up languageother types of chrome that may need to be screened out of the URLinclude, advertising, pictures, graphics and so forth. The term “chrome”as used herein can mean any text, mark up, pictures, graphics etc thatare not XML text important to the understanding of a document.

The natural language search box 61 allows a researcher to input text andthen click on the magnifying glass icon 62 to conduct the search andreturn a list 63 of related documents. The researcher may also clear thedialog box by clicking on the eraser icon 64. In this example, “GoogleToolbar” has been inserted into the natural language search box 61 and asearch related to these search terms has been conducted in the Wikissource. The results are displayed on the left side of the page in a list63 of related documents and a condensed view 72 of a selected relateddocument 65 is shown on the right side of the page. A user can click onthe Full View icon 66 to see the full view of the related documentdisplayed. In this case the selected related document 65 is theWikipedia entry/webpage for “Google Toolbar.”

When the list 63 of related documents appears after a search isconducted, the researcher then has the ability to select the star icon67 associated with each document retrieved from the search to add thedocument to a collection. After at least one document has been added tothe collection for a given research project, a Collection tab 68 willappear at the top of the webpage 60 and can be clicked on to view anyand all collected documents. A researcher can also click on the papericon 69 to add a comment specific to any related document found in thesearch. The comment will appear in the collection in a comment boxspecific to the related document.

If a researcher determines that a specific related document is extremelyrelevant, he can simply click on any of the source links also listednext to that reference to conduct a search for documents contained inthat source. This search is referred to herein as a “more like this” or“MLT” search. The MLT search is then conducted utilizing the text of theextremely relevant related document as the search criteria to findrelated documents in the other source based on the previously describedcomputing process.

For example, if the user determines that the “AOL Toolbar” Wikipediaentry is extremely relevant, he may click on the RFCs icons 71 next tothe Wikis icon 70 under the “AOL Toolbar” entry. This causes a searchautomatically to be performed to find related content in the RFC sourcethat relates to the content contained in the “AOL Toolbar” Wikipediaentry. The previously described computing process utilizes the text ofthe “AOL Toolbar” Wikipedia webpage/entry as the search criteria inperforming a search to uncover related documents in the RFC source.

FIG. 5 depicts another interactive webpage for use in conducting asearch associated with a preferred embodiment of the present invention.The interactive webpage 80 shows the results from conducting a search inthe RFC source based on the text/entire content of the Wikipedia webpagefor “AOL Toolbar.” A list 81 of RFC webpages/documents that containrelated information to the “AOL Toolbar” Wikipedia webpage is displayedon the left side of the screen. A condensed view of the first RFC entryfrom that list is displayed on the right side of the interactive webpage80.

A researcher also may continue his search by putting searchcriteria/terms into the natural language search box 82 for anothersource and continue to add documents to the collection. If theresearcher wants to add or subtract the sources to be searched, he canclick on the plus symbol icon 83 to alter the different sources thatappear on the interactive webpage 80. If a researcher chooses to add alink to the collection of documents, he can simply select the collectiontab 84.

FIG. 6 depicts another interactive webpage for use in conducting asearch associated with a preferred embodiment of the present invention.If a user selects the collection tab, an interactive webpage 90 willappear. The interactive webpage 90 allows a user to type a link into thedialog box 91 and select the additional link tab 92 if the user choosesto add a specific link to his collection. Alternatively, the researchercould simply select the export collection link 93 to permit thecollection to be shared with others. The entire collection can then besent via single URL to another individual who could then view thecontents of all documents in the collection by selecting the URL.

FIG. 7 shows a webpage displaying a collection. When a user selects theURL containing the collection, each individual document/webpage in thecollection will be displayed with a table associated therewith. In FIG.7, Table 100 is associated with the Wikipedia webpage for “GoogleToolbar” and table 101 is associated with the Wikipedia webpage for “AOLToolbar.” Table 100 has a Location field 102, Comments field 103 andTitle field 104. Likewise, table 101 has a Location field 105, Commentsfield 106 and Title field 107. The Location field shows the URLassociated with each collected document. The Comments field displays anycomments entered by the user related to the collected document, and theTitle gives the title of each document in the collection. In someembodiments of the present invention, the entire text of thedocument/webpage will be displayed beneath the table for eachdocument/webpage.

FIG. 8 displays a webpage displaying an electronic magazine (electronicdocument collection) that can be stored and shared via a single URL. TheURL 110 is listed at the top of the webpage 111. The various sources 112searched are also shown at the top of the webpage 111. The electronicmagazine is a data collection created by utilizing the computing processof the present invention. The electronic magazine containswebpages/documents found via searches of the sources 112 listed at thetop of the webpage 111. The search criteria or search terms involve theentire document/webpage shown first in the list. In FIG. 8, all of thetext from a webpage entitled “Facebook's Navigation Bar BecomesOmnipresent” contained in the Mashable.com source served as the searchcriteria/terms and all sources 112 were searched using this searchcriterion to create the electronic magazine with related documents 113from each source 112 displayed. In FIG. 8, forty results were returnedbut only some are displayed. A user can click on the arrow 114 on theright side to view more results.

The URL 110 associated with the electronic magazine is completelyportable. Anyone that clicks on the URL 110 will be directed to theelectronic magazine (collection). The content in the electronic magazineis unique and updated to deliver new content or sources because eachtime it is opened, the search is conducted based on the current contentof the webpage being used as the search criteria and any new documentsavailable in any of the sources may be added to the electronic magazineeach time it is opened. A publisher can simply add a link on its webpagethat permits an electronic magazine to be created based on the contentcontained in the current webpage as the search criteria. Depending ondefault conditions or user specifications, the sources searched may belimited or may be anything on the World Wide Web.

The URLs associated with the electronic magazine are portable and can beshared with anyone across any social or messaging platform. The contentrelated to the original page being searched in the electronic magazineis always updated so the electronic magazine is always fresh and notstatic. There is no active user participation required from the user(rating, identifying, reviewing, ranking etc.) associated with creatingand viewing the electronic magazine.

FIG. 9 depicts the preferred methodology of searching, collecting,storing and sharing electronic documents. The preferred methodology mayinclude the steps of: allowing a user to sign into an account using avariety of other website accounts 120; selecting at least one categoryof sources 121; selecting at least one source 122 (i.e. adataset/collection of documents) within at least one category ofsources; utilizing search terms to search the at least one source 123;returning related documents from the at least one source based on thesearch terms 124; collecting any of the related documents into acollection 125; permitting a related document returned to be selectedfor a further search utilizing the text of the related document as thesearch terms/criteria in a selected source to return additional relateddocuments 126; and creating a URL with all of the collected relateddocuments stored at a location referenced in the URL 127.

The step of permitting the related document returned to be selected fora further search utilizing the text of the related document as thesearch terms/criteria in a selected source to return additional relateddocuments may involve: parsing out the relevant text from any markup inthe related document and all documents in the selected source;lowercasing or uppercasing all letters in the text of the relateddocument and all documents in the source; correcting misspellings ofwords in the related document and all documents in the source;determining tokens in the related document and all documents in thesource; removing tokens that are stopwords in the related document andall documents in the source; stemming each token in the related documentand all documents in the source; transforming phrases into individualtokens in the related document and all documents in the source;associating each token with particular sections of the related documentand all documents in the source; obtaining a frequency count of thetokens in the related document and all documents in the source;transforming the count of any given token based on the sections it isassociated with in the related document and all documents in the source;normalizing the counts of the tokens between the related document andall documents in the source; transforming the counts of the tokens inways deemed appropriate for the language or nature of the dataset forthe related document and all documents in the source; calculating otherstatistics that apply to each token in the related document and alldocuments in the source; inverting the data such that each token has aset of documents it resides in, along with the associated counts andstatistics; and determining a numeric value that measures the magnitudeof each token's significance in the source. Some but not necessarily allof the computing instructions set out above may be used in one or moreembodiments of the present invention.

The step of permitting the related document returned to be selected fora further search utilizing the text of the related document as thesearch terms/criteria in a selected source to return additional relateddocuments may further involve: comparing each token's count in therelated document to its matching token's count in all other documents inthe source wherein the magnitude of closeness of the two counts has adirectly proportional contribution to the magnitude of the similarityscore between the related document and any other given document in thesource; determining which tokens are present in the related document butnot present in other documents and vice versa in the source andincluding an inversely proportional contribution to the magnitude of thesimilarity score between the related document and another document basedon the magnitude of each such tokens' count and the total number of eachsuch tokens; utilizing a token's value of significance to include adirectly proportional contribution to the magnitude of the similarityscore based on the closeness of a token's count between the relateddocument and each other document in the source; sorting the set ofsimilarity scores from the source; and displaying the similarity scoresfrom the source in ascending or descending order. The searchingtechnology discussed herein may not require the use of an extreme amountof computer resources because the indexing of documents and thesearching of documents is done in a flat manner which does not involvedproximity searching and/or logic searching (e.g. trying to understandthe meaning of words by looking for associations with other words).

While it will quite often be the case that a researcher will wish toconduct a search utilizing the entire content/text of a document as thesearch criteria, it is also possible in certain instances that only aportion of text from a document or text from multiple documents or textinput by a user will be used as the search criteria. In such asituation, the computing process will simply utilize such text increating tokens and comparing such tokens to the documents in a source.

The step of collecting any of the related documents into a collectionmay involve identifying the related documents to be collected from eachsource. The step of collecting documents may also be performed tocollect additional related documents after any search, including after afurther search is performed utilizing the entire text of the at leastone related document.

The preferred methodology may further involve sharing the relevantdocuments by sending the URL to select other users via any electronicmethod, including social networking websites and electronic mailservices 128.

FIG. 10 depicts the preferred methodology of creating an electroniccollection of related documents from a variety of sources based on theentire text of a single electronic document. The methodology mayinclude: utilizing a document such as a webpage to automatically conducta search of designated sources for documents related to the documentbased on the entire text contained in the document 140; displaying therelated documents found in the search in a collection 141; and storingthe collection under a single URL that can be utilized to display thecollection 142. A collection of related documents can be automaticallycurated into the form of an XML (Extensible Markup Language) file thatcan be linked or associated with the URL. The collection of relateddocuments can be dynamically updated by periodically and, in some cases,automatically re-running searches based on the document/webpage andusing the preferred searching methodology to include newly addeddocuments in the designated sources.

The step of utilizing a document such as a webpage to automaticallyconduct a search of designated sources for documents related to thedocument based on the entire text contained in the document involves:parsing out the relevant text from any markup in the document and alldocuments in the designated sources; lowercasing or uppercasing allletters in the text of the document and all documents in the designatedsources; correcting misspellings of words in the document and alldocuments in the designated sources; determining tokens in the documentand all documents in the designated sources; removing tokens that arestopwords in the document and all documents in the designated sources;stemming each token in the document and all documents in the designatedsources; transforming phrases into individual tokens in the document andall documents in the designated sources; associating each token withparticular sections of the document and all documents in the designatedsources; obtaining a frequency count of the tokens in the document andall documents in the designated sources; transforming the count of anygiven token based on the sections it is associated with in the documentand all documents in the designated sources; normalizing the counts ofthe tokens between the document and all documents in the designatedsources; transforming the counts of the tokens in ways deemedappropriate for the language or nature of the dataset for the documentand all documents in the designated sources; calculating otherstatistics that apply to each token in the document and all documents inthe designated sources; inverting the data such that each token has aset of documents it resides in, along with the associated counts andstatistics; and determining a numeric value (value of significance) thatmeasures the magnitude of each token's significance in the source.

The step of utilizing a document such as a webpage to automaticallyconduct a search of designated sources for documents related to thedocument based on the entire text contained in the document may furtherinvolve: comparing each token's count in the document to its matchingtoken's count in all other documents in the designated sources whereinthe magnitude of closeness of the two counts has a directly proportionalcontribution to the magnitude of the similarity score between thedocument and any other given document in the designated sources;determining which tokens are present in the document but not present inother documents and vice versa in the designated sources and includingan inversely proportional contribution to the magnitude of thesimilarity score between the document and another document from thedesignated sources based on the magnitude of each such tokens' count andthe total number of each such tokens; utilizing a token's value ofsignificance to include a directly proportional contribution to themagnitude of the similarity score based on the closeness of a token'scount between the document and each other document in the designatedsources; sorting the set of similarity scores from the designatedsources; and displaying the similarity scores from the designatedsources in ascending or descending order.

A default setting or user selected setting could be utilized to create athreshold value for a similarity score that must be achieved for adocument to be included in the collection or the maximum/minimum numberof documents that can be included in the collection.

In some embodiments of the present invention, it may be possible toscreen documents in a dataset most efficiently by picking out the tokenor tokens with the highest similarity scores (e.g. numeric frequencyfactor times significance magnitude) in the search document (i.e. wordsthat are used often in the search document and that are not used oftenin the dataset being searched) and only do similarity searches ondocuments in the dataset that have these highest ranking tokens. Forexample if the word “interstellar” is used 100 times in a patent (i.e.the search document) and one is searching the US patent database (i.e.dataset) for similar patents and applications it may be desirable toonly look at the patents and applications that contain that particularword (i.e. “interstellar”) or that particular word combined (not in anyparticular proximity) with other high ranking words/tokens. Thisdramatically increases the speed of the search since one would besearching only hundreds or perhaps thousands of patents rather than the8 million plus patents in the US patent database.

However, screening of documents by only looking at a subset of thedocuments in a dataset based on some key words/tokens is not necessaryin preferred embodiments of the present invention because the flatsearching and indexing allows for similarity searches of all thedocuments in a dataset with a single aggregate similarity scorecalculated for each document in the dataset vis-à-vis the searchcriteria document.

In at least one embodiment of the discovery engine disclosed herein,similarity searches may be made across different datasets. This can be adifficult similarity search problem because different datasets containdifferent styles of language, which can skew the similarity results. Asan example, patent databases are full of patents that use arcane legallanguage, which is not always consistent with how a layman would speak(e.g. the continual use of terms such as “comprising” “methods”,“apparatus” and so forth). The use of the value of significance factorwill help remedy this problem to a certain extent since these arcaneterms in the patent dataset are weighted less heavily than they'd beweighted in other datasets. However a style of writing in a dataset canhave an effect on relative similarity scores even after the significancevalue has been applied.

For example, one could imagine a Wikipedia document that has contentthat is closely related to a search being done on a patent being used asthe search criteria document. However, one could also imagine that theWikipedia document has a lower gross similarity score than certainpatent documents that are not as closely related from a pure contentperspective. This could happen because the similar writing styles usedby patent attorneys will necessarily result in higher similarity scoresfor other documents written in the same arcane legal style (i.e. otherpatents in the patent base). This bias toward documents written in acertain style can have the result of “hiding” or screening out moreclosely related documents from other datasets written in differentstyles because of a lower gross similarity score.

Embodiments of the present invention can address this “dataset bias”problem by normalizing the similarity results. FIG. 11 discloses amethod 200 configured in accordance with an embodiment of the presentinventions for normalizing the similarity results of documents within adataset. It is disclosed herein that a device configured in accordancewith an embodiment of the present invention can be configured to performthe method 200 (e.g., a set of instructions of the device 10 beingconfigured to perform the method 200).

The method 200 begins with an operation 202 being performed for indexingall documents in each one of a plurality of datasets in accordance withan embodiment of the present invention. The documents in a particulardataset are flatly indexed) using only a word count frequency factor foreach word in each document (i.e., via the first frequency count) and asignificance factor for each word based on the aggregate frequency thateach particular word is used in the entire dataset (i.e., via the secondfrequency count). As discussed previously, the more often a word appearsin a dataset the less high is its significance factor. Also as discussedabove the frequency factor can be lowered by a logarithmic function. Theresult of such documents of a dataset is that each word (e.g., token)present in each document in the dataset has a numerical value that isthe product of a frequency factor and a significance factor.

An operation 204 is performed for designating search criteria.Designating the search criteria can include a user (e.g., of the device10) choosing a document or a URL as the search criteria for use insearching for similar documents in each one of the datasets. Afterdesignating the search criteria, an operation 206 is performed forconducting a similarity search on each word in the search criteriadocument in relation to (i.e., vis-à-vis) each word in each document inthe dataset to determine word-by-word significance magnitude factor foreach document. The significance magnitude factor is a type of similarityscore on a per-word (e.g., per-token) basis. Flatly searching isperformed in the same manner as disclosed above in reference to thedatasets of the operation 202. An operation 208 is then performed foraggregating the word-by-word significance magnitude factors to generatean aggregate similarity score. Aggregating the word-by-word similarityscores to generate the aggregate similarity score includes aggregating(e.g., combining) the word-by-word similarity score for every word ineach of the documents of a dataset thereby determining a singlesimilarity score for each document in a dataset vis-à-vis the searchcriteria document.

In order to normalize the similarity scores between different datasets,it is necessary to go beyond the single (i.e., gross) similarity scoreof each document. What is needed is a normalized comparison ofsimilarities within each dataset. To this end, an operation 210 isperformed for determining an arithmetic mean of the similarity scoresfor all of the documents in each one of the datasets. Determining anarithmetic mean of the similarity scores for all of the documents in thedataset can include calculating the mean (e.g., average) similarityscore vis-à-vis the search criteria for the entire dataset. For example,the mean similarity score can be calculated by adding up all thesimilarity scores for all the documents in the dataset and then dividingthis total number by the number of documents in the dataset. It is alsopossible to use other statistical averages represented by bell curves.After the mean similarity score (or some other statistically averagenumber) for each dataset is known, an operation 212 is performed forgenerating a dataset normalized similarity score for each document ofthe dataset dependent upon the arithmetic mean of the similarity scoresfor all of the documents in the dataset. Generating a dataset normalizedsimilarity score for each document of the dataset can includedetermining a variance factor (e.g., the variance or deviation (e.g.standard deviation)) for each document relative to the arithmetic meanof the similarity scores for all of the documents in the dataset andthen multiply the variance factor for a particular document times thegross similarity score of the particular document, thereby generating anormalized similarity score that has a contextual relationship to thedataset. This normalized similarity score allows for comparison ofdocuments between datasets with varying styles of writing andorganization.

For example, a patent document when compared to a search criteriadocument might have a similarity score of X, which is twice as high asthe gross similarity score for a particular Wikipedia article at 0.5X.However, it may be that the other patents in the dataset might also havea similarity score not considerably different from X so that thedeviation from the mean is pretty low (e.g. 1.2 times). On the otherhand, if the mean Wikipedia similarity score is very low, the deviationfrom the mean might be a factor of 3. In this case the normalizedWikipedia document score (3 times 0.5X=1.5X) is higher than thenormalized patent document score of (1.2 times X+1.2X) and the Wikipediaarticle would rank higher in normalized similarity then the patentdocument.

This deviation from average or mean similarity scores in a dataset is anefficient and effective way or normalizing results between datasets.Accordingly, an operation 214 is performed for determining relevance ofdocuments of different datasets dependent upon the normalized similarityscore of each document as opposed to the non-normalized similarity scoreof each document.

A person of skill in the art would readily recognize that steps of thevarious above-described methods can be performed by programmed computersand the order of the steps is not necessarily critical. Herein, someembodiments of the present invention are intended to cover programstorage devices, e.g., digital data storage media, which are machine orcomputer readable and encode machine-executable or computer executableprograms of instructions where said instructions perform some or all ofthe steps of methods described herein. The program storage devices maybe, e.g., digital memories, magnetic storage media such as magneticdisks or tapes, hard drives, or optically readable digital data storagemedia. Some embodiments of the present invention are also intended tocover computers programmed to perform said steps of methods describedherein.

In view of the disclosures made herein, a skilled person will appreciatethat a discovery engine configured in accordance with an embodiment ofthe present invention can be implemented in a manner that enablesfunctionality as depicted in the following example.

Example—Discover Engine Functionality

Using a discovery engine configured in accordance with the disclosuremade herein (i.e., an inventive search engine) or prior art searchmethod, a researcher has found a patent of interest entitled“Interstellar Light Collector” (i.e., U.S. Pat. No. 7,338,148, which hishereinafter referred to as the '148 patent). The researcher isinterested in finding other US patents similar to the '148 patent. Usingthe entire text of the '148 patent as the search criteria within theinventive search engine, the researcher designates (e.g., choosesthrough selection from a plurality of selection options) the US patentdatabase as the dataset to be searched using the '148 patent text as thesearch criteria.

The inventive search engine performs flat searching of documents withinthe dataset (i.e., the US patent database) with respect to the searchcriteria (i.e., the entire text of the '148 patent). The search criteriaare processed for identifying tokens therein. As illustrated below foran exemplary token (e.g., word) “interstellar”, each token of the '148patent is processed in accordance with the present invention forenabling calculation of a similarity score.

In the '148 patent, the token “interstellar” is found 8 times (i.e., thefirst frequency count for the token “interstellar”). Rather than usingthe raw number 8 as a token frequency multiplier, the raw number can belowered by using a logarithmic function the raw number. In this example,the square root of 8 is utilized (i.e., frequency multiplier=√8). Tofind a significance magnitude factor for each individual token (e.g.,word) for “interstellar” used in calculation of the similarity score, itis necessary to multiply its frequency multiplier by its significancefactor. In this example (i.e., (not an actual count of the number oftimes “instellar” is used in the US patent dataset), the token“interstellar” is theoretically found 8,000 times in the entire USpatent database of 8 million patents (i.e., the aggregate token count ofthe dataset), which is the second frequency count for the token“interstellar”). The aggregate token count is then divided by the numberof documents in the dataset (i.e., the 8 million patents of the USpatent database) to find that the frequency of use in the dataset is8,000/8,000,000 or 1/1000. Significance is the inverse of frequency ofuse in the dataset, such that the significance factor of the word“interstellar” in this example is 1000. Accordingly, the significancefactor for the word “interstellar” in the '148 patent (i.e. searchcriteria document) is √8×1000 or nominally 2,282.

Calculation of frequency multiplier and significance factor is repeatedfor every token in the '148 patent and for every document in the datasetto determine significance magnitude factors for all tokens of alldocument in the dataset with respect to the search criteria and arethereafter used for calculating the overall similarity score between the'148 patent and each document in the dataset (e.g., by summing all ofthe significance magnitude factors above a certain threshold, byselecting only significance magnitude factors within a designateddeviation from a greatest the significance magnitude factor, etc). Inthis regard, in view of the disclosures made herein, a skilled personwill appreciate that the similarity score is a function of a firstfrequency count (i.e., that characterizes a number of times that eachtoken of the search criteria occurs within the search criteria (i.e.,text thereof) and each one of the documents in the source) and a secondfrequency count (i.e., that characterizing an aggregate number of timesthat a particular one of the tokens occurs within all of the documentsin the dataset).

Turning now to a discussion of approaches for implementing embodimentsof the present invention, systems and methods in accordance withembodiments of the present invention can be implemented in any number ofdifferent types of data processing systems (e.g., a computer system) inaddition to the specific physical implementation of a data processingsystem in the form of a smart phone, tablet or similar configuration ofmobile communication device. To this end, FIG. 12 shows a diagrammaticrepresentation of one embodiment of a computer system 500 within which aset of instructions can execute for causing a device to perform orexecute any one or more of the aspects and/or methodologies of thepresent disclosure. The components in FIG. 12 are examples only and donot limit the scope of use or functionality of any hardware, software,embedded logic component, or a combination of two or more suchcomponents implementing particular embodiments.

The computer system 500 can include a processor 501, memory 503, andstorage 508 that communicate with each other, and with other components,via a bus 540. The bus 540 can also link a display 532, one or moreinput devices 533 (which can, for example, include a keypad, a keyboard,a mouse, a stylus, etc.), one or more output devices 534, one or morestorage devices 535, and various tangible (e.g., non-transitory) storagemedia 536. All of these elements can interface directly or via one ormore interfaces or adaptors to the bus 540. For instance, the varioustangible storage media 536 can interface with the bus 540 via storagemedium interface 526. Computer system 500 can have any suitable physicalform, including but not limited to one or more integrated circuits(ICs), printed circuit boards (PCBs), mobile communication devices (suchas smart phones, tablets, personal digital assistants (PDAs)), laptop ornotebook computers, distributed computer systems, computing grids, orservers. All or a portion of the elements 501-536 can be housed in asingle unit (e.g., a cell phone housing, a tablet housing, or the like).

Processor(s) 501 (or central processing unit(s) (CPU(s))) optionallycontains a cache memory unit 502 for temporary local storage ofinstructions, data, or computer addresses. Processor(s) 501 areconfigured to assist in execution of computer readable instructions.Computer system 500 can provide functionality as a result of theprocessor(s) 501 executing software embodied in one or more tangible(e.g., non-transitory) computer-readable storage media, such as memory503, storage 508, storage devices 535, and/or storage medium 536. Thecomputer-readable media can store software that implements particularembodiments, and processor(s) 501 can execute the software. Memory 503can read the software from one or more other computer-readable media(such as mass storage device(s) 535, 536) or from one or more othersources through a suitable interface, such as network interface 520. Thesoftware can cause processor(s) 501 to carry out one or more processesor one or more steps of one or more processes described or illustratedherein. Carrying out such processes or steps (i.e., operations) caninclude defining data structures stored in memory 503 and modifying thedata structures as directed by the software.

The memory 503 can include various components (e.g., machine readablemedia) including, but not limited to, a random access memory component(e.g., RAM 504) (e.g., a static RAM “SRAM”, a dynamic RAM “DRAM, etc.),a read-only component (e.g., ROM 505), and any combinations thereof. ROM505 can act to communicate data and instructions unidirectionally toprocessor(s) 501, and RAM 504 can act to communicate data andinstructions bidirectionally with processor(s) 501. ROM 505 and RAM 504can include any suitable tangible computer-readable media describedbelow. In one example, a basic input/output system 506 (BIOS), includingbasic routines that help to transfer information between elements withincomputer system 500, such as during start-up, can be stored in thememory 503.

Fixed storage 508 is connected bidirectionally to processor(s) 501,optionally through storage control unit 507. Fixed storage 508 providesadditional data storage capacity and can also include any suitabletangible computer-readable media described herein. Storage 508 can beused to store operating system 509, EXECs 510 (executables), data 511,APV applications 512 (application programs), and the like. Often,although not always, storage 508 is a secondary storage medium (such asa hard disk) that is slower than primary storage (e.g., memory 503).Storage 508 can also include an optical disk drive, a solid-state memorydevice (e.g., flash-based systems), or a combination of any of theabove. Information in storage 508 can, in appropriate cases, beincorporated as virtual memory in memory 503.

In one example, storage device(s) 535 can be removably interfaced withcomputer system 500 (e.g., via an external port connector (not shown))via a storage device interface 525. Particularly, storage device(s) 535and an associated machine-readable medium can provide nonvolatile and/orvolatile storage of machine-readable instructions, data structures,program modules, and/or other data for the computer system 500. In oneexample, software can reside, completely or partially, within amachine-readable medium on storage device(s) 535. In another example,software can reside, completely or partially, within processor(s) 501.

Bus 540 connects a wide variety of subsystems. Herein, reference to abus can encompass one or more digital signal lines serving a commonfunction, where appropriate. Bus 540 can be any of several types of busstructures including, but not limited to, a memory bus, a memorycontroller, a peripheral bus, a local bus, and any combinations thereof,using any of a variety of bus architectures. As an example and not byway of limitation, such architectures include an Industry StandardArchitecture (ISA) bus, an Enhanced ISA (EISA) bus, a Micro ChannelArchitecture (MCA) bus, a Video Electronics Standards Association localbus (VLB), a Peripheral Component Interconnect (PCI) bus, a PCI-Express(PCI-X) bus, an Accelerated Graphics Port (AGP) bus, HyperTransport(HTX) bus, serial advanced technology attachment (SATA) bus, and anycombinations thereof.

Computer system 500 can also include an input device 533. In oneexample, a user of computer system 500 can enter commands and/or otherinformation into computer system 500 via input device(s) 533. Examplesof an input device(s) 533 include, but are not limited to, analpha-numeric input device (e.g., a keyboard), a pointing device (e.g.,a mouse or touchpad), a touchpad, a joystick, a gamepad, an audio inputdevice (e.g., a microphone, a voice response system, etc.), an opticalscanner, a video or still image capture device (e.g., a camera), and anycombinations thereof. Input device(s) 533 can be interfaced to bus 540via any of a variety of input interfaces 523 (e.g., input interface 523)including, but not limited to, serial, parallel, game port, USB(universal serial bus), FIREWIRE, THUNDERBOLT, or any combination of theabove.

In particular embodiments, when computer system 500 is connected tonetwork 530, computer system 500 can communicate with other devices,specifically mobile devices and enterprise systems, connected to network530. Communications to and from computer system 500 can be sent throughnetwork interface 520. For example, network interface 520 can receiveincoming communications (such as requests or responses from otherdevices) in the form of one or more packets (such as Internet Protocol(IP) packets) from network 530, and computer system 500 can store theincoming communications in memory 503 for processing. Computer system500 can similarly store outgoing communications (such as requests orresponses to other devices) in the form of one or more packets in memory503 and communicated to network 530 from network interface 520.Processor(s) 501 can access these communication packets stored in memory503 for processing.

Examples of the network interface 520 include, but are not limited to, anetwork interface card, a modem, and any combination thereof. Examplesof a network 530 or network segment 530 include, but are not limited to,a wide area network (WAN) (e.g., the Internet, an enterprise network), alocal area network (LAN) (e.g., a network associated with an office, abuilding, a campus or other relatively small geographic space), atelephone network, a direct connection between two computing devices,and any combinations thereof. A network, such as network 530, can employa wired and/or a wireless mode of communication. In general, any networktopology can be used.

Information and data can be displayed through a display 532. Examples ofa display 532 include, but are not limited to, a liquid crystal display(LCD), an organic liquid crystal display (OLED), a cathode ray tube(CRT), a plasma display, and any combinations thereof. The display 532can interface to the processor(s) 501, memory 503, and fixed storage508, as well as other devices, such as input device(s) 533, via the bus540. The display 532 is linked to the bus 540 via a video interface 522,and transport of data between the display 532 and the bus 540 can becontrolled via the graphics control 521.

In addition to a display 532, computer system 500 can include one ormore other peripheral output devices 534 including, but not limited to,an audio speaker, a printer, and any combinations thereof. Suchperipheral output devices can be connected to the bus 540 via an outputinterface 524. Examples of an output interface 524 include, but are notlimited to, a serial port, a parallel connection, a USB port, a FIREWIREbrand port, a THUNDERBOLT brand port, a LIGHTNING brand port, and anycombinations and/or connectors thereof.

In addition or as an alternative, computer system 500 can providefunctionality as a result of logic hardwired or otherwise embodied in acircuit, which can operate in place of or together with software toexecute one or more processes or one or more steps of one or moreprocesses described or illustrated herein. Reference to software in thisdisclosure can encompass logic, and reference to logic can encompasssoftware. Moreover, reference to a computer-readable medium (alsosometimes referred to as machine-readable medium” can encompass acircuit (such as an IC) storing software for execution, a circuitembodying logic for execution, or both, where appropriate. The presentdisclosure encompasses any suitable combination of hardware, software,or both.

The term “computer-readable medium” should be understood to include anystructure that participates in providing data that can be read by anelement of a computer system. Such a medium can take many forms,including but not limited to, non-volatile media, volatile media, andtransmission media. Non-volatile media include, for example, optical ormagnetic disks and other persistent memory. Volatile media includedynamic random access memory (DRAM) and/or static random access memory(SRAM). Transmission media include cables, wires, and fibers, includingthe wires that comprise a system bus coupled to processor. Common formsof machine-readable media include, for example, a floppy disk, aflexible disk, a hard disk, a magnetic tape, any other magnetic medium,a CD-ROM, a DVD, any other optical medium.

Those of skill in the art would understand that information and signalscan be represented using any of a variety of different technologies andtechniques. For example, data, instructions, commands, information,signals, bits, symbols, and chips that can be referenced throughout theabove description can be represented by voltages, currents,electromagnetic waves, magnetic fields or particles, optical fields orparticles, or any combination thereof.

Those of skill would further appreciate that the various illustrativelogical blocks, modules, circuits, and algorithm steps described inconnection with the embodiments disclosed herein can be implemented aselectronic hardware, computer software, or combinations of both. Toclearly illustrate this interchangeability of hardware and software,various illustrative components, blocks, modules, circuits, and stepshave been described above generally in terms of their functionality.Whether such functionality is implemented as hardware or softwaredepends upon the particular application and design constraints imposedon the overall system. Skilled artisans can implement the describedfunctionality in varying ways for each particular application, but suchimplementation decisions should not be interpreted as causing adeparture from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits describedin connection with the embodiments disclosed herein can be implementedor performed with a general purpose processor, a digital signalprocessor (DSP), an application specific integrated circuit (ASIC), afield programmable gate array (FPGA) or other programmable logic device,discrete gate or transistor logic, discrete hardware components, or anycombination thereof designed to perform the functions described herein.A general purpose processor can be a microprocessor, but in thealternative, the processor can be any conventional processor,controller, microcontroller, or state machine. A processor can also beimplemented as a combination of computing devices, e.g., a combinationof a DSP and a microprocessor, a plurality of microprocessors, one ormore microprocessors in conjunction with a DSP core, or any other suchconfiguration.

The steps of a method or algorithm described in connection with theembodiments disclosed herein can be embodied directly in hardware, in asoftware module executed by a processor, or in a combination of the two.A software module can reside in RAM memory, flash memory, ROM memory,EPROM memory, EEPROM memory, registers, hard disk, a removable disk, aCD-ROM, or any other form of storage medium known in the art. Anexemplary storage medium is coupled to the processor such the processorcan read information from, and write information to, the storage medium.In the alternative, the storage medium can be integral to the processor.The processor and the storage medium can reside in an ASIC. The ASIC canreside in a user terminal. In the alternative, the processor and thestorage medium can reside as discrete components in a user terminal.

The previous description of the disclosed embodiments is provided toenable any person skilled in the art to make or use the presentinvention. Various modifications to these embodiments will be readilyapparent to those skilled in the art, and the generic principles definedherein can be applied to other embodiments without departing from thespirit or scope of the invention. Thus, the present invention is notintended to be limited to the embodiments shown herein but is to beaccorded the widest scope consistent with the principles and novelfeatures disclosed herein.

It will be recognized by those skilled in the art that changes ormodifications may be made to the above-described embodiments of thepresent invention without departing from the broad inventive concepts ofthe invention. It should therefore be understood that this invention isnot limited to the particular embodiments of the present inventiondescribed herein, but is of the invention as set forth in the claims.

1-26. (canceled) 27: A method of semantically searching documents in away that improves the efficiency of computer resources, comprising:indexing by a processor a data set of documents having words by countingthe words in the entire data set and determining a first frequency scoreand a first uniqueness score for each word of the data set of documents;receiving a user input by the processor of a document of interest;determining by the processor a second frequency score and a seconduniqueness score for each word in the document of interest; generatingby the processor a respective similarity score for the document ofinterest compared to each of the documents in the data set of documentsin a flat manner by comparing the second frequency score and the seconduniqueness score for each word in the document of interest to the firstfrequency score and the first uniqueness score for each word of the dataset of documents; and presenting by the processor the most similardocuments from the data set to the document of interest using therespective similarity score for the document of interest compared toeach of the documents in the data set of documents. 28: The method ofclaim 27, wherein the presenting includes sorting by the processor therespective similarity score to obtain at least one most similar documentof the data set of documents to the document of interest and displayinga ranked listing of the at least one most similar document. 29: Themethod of claim 28, further comprising: receiving a next user input tothe processor of the at least one most similar document of the data setof documents; determining by the processor a third frequency score and athird uniqueness score for each word in the at least one most similardocument; generating by the processor a respective second similarityscore for the at least one most similar document compared to each of thedocuments in the data set of documents in a flat manner by comparing thethird frequency score and the third uniqueness score for each work inthe at least one most similar document to the first frequency score andthe first uniqueness core for each word of the data set of documents;and presenting by the processor the most similar documents from the dataset to the at least one most similar document using the respectivesecond similarity score for the at least one most similar documentcompared to each of the documents in the data set of documents. 30: Themethod of claim 27 wherein: the user input includes a uniform resourcelocator (URL): and receiving the user input includes accessing by theprocessor information residing at a location designated by the URL. 31:The method of claim 27, further comprising: normalizing by the processorthe respective similarity score for the document of interest compared toeach of the documents of the data set of documents. 32: A system forsemantically searching documents to improve efficiency of computerresources, comprising: a memory containing a set of instructions; and aprocessor for processing the set of instructions, wherein theinstructions cause the processor to perform a method comprising:indexing a data set of documents having words by counting the words inthe entire data set and determining a first frequency score and a firstuniqueness score for each word of the data set of documents; receiving auser input of a document of interest; determining a second frequencyscore and a second uniqueness score for each word in the document ofinterest; generating a respective similarity score for the document ofinterest compared to each of the documents in the data set of documentsin a flat manner by comparing the second frequency score and the seconduniqueness score for each word in the document of interest to the firstfrequency score and the first uniqueness score for each word of the dataset of documents; and presenting the most similar documents from thedata set to the document of interest using the respective similarityscore for the document of interest compared to each of the documents inthe data set of documents. 33: The system of claim 32, whereinpresenting includes sorting the respective similarity score to obtain atleast one most similar document of the data set of documents to thedocument of interest and displaying a ranked listing of the at least onemost similar document. 34: The system of claim 33, wherein theinstructions cause the processor to perform a method further comprising:receiving a next user input of the at least one most similar document ofthe data set of documents; determining a third frequency score and athird uniqueness score for each word in the at least one most similardocument; generating a respective second similarity score for the atleast one most similar document compared to each of the documents in thedata set of documents in a flat manner by comparing the third frequencyscore and the third uniqueness score for each work in the at least onemost similar document to the first frequency score and the firstuniqueness core for each word of the data set of documents; andpresenting the most similar documents from the data set to the at leastone most similar document using the respective second similarity scorefor the at least one most similar document compared to each of thedocuments in the data set of documents. 35: The system of claim 32,wherein: the user input includes a uniform resource locator (URL); andreceiving the user input includes accessing information residing at alocation designated by the URL. 36: The system of claim 32, wherein theinstructions cause the processor to perform a method further comprising:normalizing the respective similarity score for the document of interestcompared to each of the documents of the data set of documents. 37: Anon-transitory computer-readable medium having tangibly embodied thereonand accessible therefrom processor-executable instructions that, whenexecuted by at least one data processing device of at least onecomputer, causes said at least one data processing device to perform amethod comprising: indexing a data set of documents having words bycounting the words in the entire data set and determining a firstfrequency score and a first uniqueness score for each word of the dataset of documents; receiving a user input of a document of interest;determining a second frequency score and a second uniqueness score foreach word in the document of interest; generating a respectivesimilarity score for the document of interest compared to each of thedocuments in the data set of documents in a flat manner by comparing thesecond frequency score and the second uniqueness score for each word inthe document of interest to the first frequency score and the firstuniqueness score for each word of the data set of documents; andpresenting the most similar documents from the data set to the documentof interest using the respective similarity score for the document ofinterest compared to each of the documents in the data set of documents.38: The non-transitory computer readable medium of claim 37, wherein thepresenting includes sorting the respective similarity score to obtain atleast one most similar document of the data set of documents to thedocument of interest and displaying a ranked listing of the at least onemost similar document. 39: The non-transitory computer readable mediumof claim 38, wherein the method further comprises: receiving a next userinput to the processor of the at least one most similar document of thedata set of documents; determining by the processor a third frequencyscore and a third uniqueness score for each word in the at least onemost similar document; generating by the processor a respective secondsimilarity score for the at least one most similar document compared toeach of the documents in the data set of documents in a flat manner bycomparing the third frequency score and the third uniqueness score foreach work in the at least one most similar document to the firstfrequency score and the first uniqueness core for each word of the dataset of documents; and presenting by the processor the most similardocuments from the data set to the at least one most similar documentusing the respective second similarity score for the at least one mostsimilar document compared to each of the documents in the data set ofdocuments. 40: The non-transitory computer readable medium of claim 37,wherein: the user input includes a uniform resource locator (URL): andreceiving the user input includes accessing by the processor informationresiding at a location designated by the URL. 41: The non-transitorycomputer readable medium of claim 37, wherein the method furthercomprises: normalizing by the processor the respective similarity scorefor the document of interest compared to each of the documents of thedata set of documents.